Improving Low-Resource Statistical Machine Translation with a Novel Semantic Word Clustering Algorithm

نویسندگان

  • Jeff MA
  • Spyros Matsoukas
  • Richard Schwartz
چکیده

In this paper we present a non-languagespecific strategy that uses large amounts of monolingual data to improve statistical machine translation (SMT) when only a small parallel training corpus is available. This strategy uses word classes derived from monolingual text data to improve the word alignment quality, which generally deteriorates significantly because of insufficient training. We present a novel semantic word clustering algorithm to generate the word classes motivated by the word similarity metric presented in (Lin, 1998). Our clustering results showed this novel word clustering outperforms a state-of-the-art hierarchical clustering. We then designed a new procedure for using the derived word classes to improve word alignment quality. Our experiments showed that the use of the word classes can recover over 90% of the loss resulting from the alignment quality that is lost due to the limited parallel training.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving word alignment for low resource languages using English monolingual SRL

We introduce a new statistical machine translation approach specifically geared to learning translation from low resource languages, that exploits monolingual English semantic parsing to bias inversion transduction grammar (ITG) induction. We show that in contrast to conventional statistical machine translation (SMT) training methods, which rely heavily on phrase memorization, our approach focu...

متن کامل

Bilingual Word Spectral Clustering for Statistical Machine Translation

In this paper, a variant of a spectral clustering algorithm is proposed for bilingual word clustering. The proposed algorithm generates the two sets of clusters for both languages efficiently with high semantic correlation within monolingual clusters, and high translation quality across the clusters between two languages. Each cluster level translation is considered as a bilingual concept, whic...

متن کامل

Arabic-English Semantic Word Class Alignment to Improve Statistical Machine Translation

Clustering words is a widely used technique in statistical natural language processing. It requires syntactic, semantic, and contextual features. Especially, semantic clustering is gaining a lot of interest. It consists in grouping a set of words expressing the same idea or sharing the same semantic properties. In this paper, we present a new method to integrate semantic classes in a Statistica...

متن کامل

Paraphrasing Out-of-Vocabulary Words with Word Embeddings and Semantic Lexicons for Low Resource Statistical Machine Translation

Out-of-vocabulary (OOV) word is a crucial problem in statistical machine translation (SMT) with low resources. OOV paraphrasing that augments the translation model for the OOV words by using the translation knowledge of their paraphrases has been proposed to address the OOV problem. In this paper, we propose using word embeddings and semantic lexicons for OOV paraphrasing. Experiments conducted...

متن کامل

Semantically-Informed Syntactic Machine Translation: A Tree-Grafting Approach

We describe a unified and coherent syntactic framework for supporting a semanticallyinformed syntactic approach to statistical machine translation. Semantically enriched syntactic tags assigned to the target-language training texts improved translation quality. The resulting system significantly outperformed a linguistically naive baseline model (Hiero), and reached the highest scores yet repor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011